high-dimensional regression
Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality
Avoiding overfitting is a central challenge in machine learning, yet many large neural networks readily achieve zero training loss. Here we quantify overfitting via residual information, defined as the bits in fitted models that encode noise in training data. Information efficient learning algorithms minimize residual information while maximizing the relevant bits, which are predictive of the unknown generative models. We solve this optimization to obtain the information content of optimal algorithms for a linear regression problem and compare it to that of randomized ridge regression. Our results demonstrate the fundamental trade-off between residual and relevant information and characterize the relative information efficiency of randomized regression with respect to optimal algorithms.
Interpretation of High-Dimensional Regression Coefficients by Comparison with Linearized Compressing Features
Schaeffer, Joachim, Rhyu, Jinwook, Droop, Robin, Findeisen, Rolf, Braatz, Richard
Linear regression is often deemed inherently interpretable; however, challenges arise for high-dimensional data. We focus on further understanding how linear regression approximates nonlinear responses from high-dimensional functional data, motivated by predicting cycle life for lithium-ion batteries. We develop a linearization method to derive feature coefficients, which we compare with the closest regression coefficients of the path of regression solutions. We showcase the methods on battery data case studies where a single nonlinear compressing feature, $g\colon \mathbb{R}^p \to \mathbb{R}$, is used to construct a synthetic response, $\mathbf{y} \in \mathbb{R}$. This unifying view of linear regression and compressing features for high-dimensional functional data helps to understand (1) how regression coefficients are shaped in the highly regularized domain and how they relate to linearized feature coefficients and (2) how the shape of regression coefficients changes as a function of regularization to approximate nonlinear responses by exploiting local structures.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > New York (0.05)
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.05)
- (4 more...)
- Energy > Energy Storage (0.50)
- Electrical Industrial Apparatus (0.50)
Robust Variable Selection for High-dimensional Regression with Missing Data and Measurement Errors
The linear relationship between response variables and covariates has been the topic of interest.In the classical squared loss function,it is usually assumed that the data obey a normal distribution.However,the data discussed in this paper contain a large number of missing data and measurement errors,such that the datausually do not conform to any of the common forms of data distribution.We propose a method based on an exponential squared loss function with tuning parameter.For data with different distributions,a better result of linear regression can be achieved by changing the value of the tuning parameter h.Therefore,forany kind of data distribution,going with an exponential squared loss function with moderating variables will be highly robust.For any data distribution,the loss function is strongly robust for h (0,+x).In previous studies,when using the traditional squared loss function,the data distribution requirements are very high,resulting in the traditional exponential squared loss function being very sensitive to anomalies.This reduces the estimation efficiency of the model,and this drawback becomes more obvious in data containing missing data with measurement errors.In contrast,the use of exponential squared loss functions can improve the estimation efficiency of the model by varying thetuning parameter h in a way that adapts to more distributed forms of data sets and produces more reliable estimates. In the traditional squared loss function,the values of the covariates are always defaulted to be free ofmissingdata and measurement errors.Even if missing data and measurement errors exist,they are assumed to be absent or these data are removed.However,this assumption is often broken in studies in disciplines such as health and epidemiology.As an illustration,Zhang and Zhou(1)looked at a collection of breast cancer patients to identify the gene expression that was associated with long-term disease-free survival.The datacollection consists of 24481 gene probes collected from 78 breast cancer patients.In particular,using the log-value of the ratio (log1o(Ratio)),which could be denoted as Y,it is possible to forecast the disease-free survival.In truth,gene sensors will inevitably lead to measurement errors.In this breast cancer data set,the(log1o(Ratio))numbers have missing data. When there are a large numberof missing data and measurement errors in a dataset,if we ignore the missing data and measurement errors and use the traditional square loss function for estimation,the estimation accuracy of the model will be greatly affected due to the chaotic data distribution,resulting in significant estimation bias.In the above dataset, We discover that employing the traditional squared loss function,which handles data with measurement errors and Robust Variable Selection for High-dimensional Regression with Missing Data and Measurement Errors
Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality
Avoiding overfitting is a central challenge in machine learning, yet many large neural networks readily achieve zero training loss. Here we quantify overfitting via residual information, defined as the bits in fitted models that encode noise in training data. Information efficient learning algorithms minimize residual information while maximizing the relevant bits, which are predictive of the unknown generative models. We solve this optimization to obtain the information content of optimal algorithms for a linear regression problem and compare it to that of randomized ridge regression. Our results demonstrate the fundamental trade-off between residual and relevant information and characterize the relative information efficiency of randomized regression with respect to optimal algorithms.
Randomized tests for high-dimensional regression: A more efficient and powerful solution
We investigate the problem of testing the global null in the high-dimensional regression models when the feature dimension p grows proportionally to the number of observations n . Despite a number of prior work studying this problem, whether there exists a test that is model-agnostic, efficient to compute and enjoys a high power, still remains unsettled. In this paper, we answer this question in the affirmative by leveraging the random projection techniques, and propose a testing procedure that blends the classical F -test with a random projection step. When combined with a systematic choice of the projection dimension, the proposed procedure is proved to be minimax optimal and, meanwhile, reduces the computation and data storage requirements. We illustrate our results in various scenarios when the underlying feature matrix exhibits an intrinsic lower dimensional structure (such as approximate low-rank or has exponential/polynomial eigen-decay), and it turns out that the proposed test achieves sharp adaptive rates.
Debiased high-dimensional regression calibration for errors-in-variables log-contrast models
Motivated by the challenges in analyzing gut microbiome and metagenomic data, this work aims to tackle the issue of measurement errors in high-dimensional regression models that involve compositional covariates. This paper marks a pioneering effort in conducting statistical inference on high-dimensional compositional data affected by mismeasured or contaminated data. We introduce a calibration approach tailored for the linear log-contrast model. Under relatively lenient conditions regarding the sparsity level of the parameter, we have established the asymptotic normality of the estimator for inference. Numerical experiments and an application in microbiome study have demonstrated the efficacy of our high-dimensional calibration strategy in minimizing bias and achieving the expected coverage rates for confidence intervals. Moreover, the potential application of our proposed methodology extends well beyond compositional data, suggesting its adaptability for a wide range of research contexts.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- North America > United States > California > Alameda County > Berkeley (0.14)
- North America > United States > New York (0.04)